-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: User Guide Page on user-defined functions #61195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Currently writing this, so I would appreciate any feedback on it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I'm not opposed to a dedicated page on UDFs, but I am opposed to duplicating documentation that exists elsewhere in the user guide, as I think much of this does. Instead of e.g. examples of apply
, I recommend linking to the appropriate section. This page can then focus on recommendations of when to use apply vs other methods.
Why Use User-Defined Functions? | ||
------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should lead with Why _not_ User-Defined Functions
. While performance is called out down below, I think the poor behavior of UDFs should be mentioned as well. Namely that pandas has no information on what a UDF is doing, and so has to infer (guess) at how to handle the result.
In particular, I think it should be mentioned that none of the examples on this page should be UDFs in practice.
Hi @rhshadrach thanks for the feedback! I agree with you and will push updates soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is looking a lot better. Can we also link to https://pandas.pydata.org/pandas-docs/dev/user_guide/enhancingperf.html#numba-jit-compilation at the very bottom in a section titled something like "Improving Performance with UDFs".
ways to apply UDFs across different pandas data structures. | ||
|
||
.. note:: | ||
Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also make a mention of resample, rolling, expanding, and ewm. Perhaps link to each section in the User Guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the other objects to this note, it seems to me they all belong together.
Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`. | |
Some of these methods are can also be applied to groupby, resample, and various window objects. See :ref:`groupby`, :ref:`resample()<timeseries>`, :ref:`rolling()<window>`, :ref:`expanding()<window>`, and :ref:`ewm()<window>` for details. |
pandas comes with a set of built-in functions for data manipulation, UDFs offer | ||
flexibility when built-in methods are not sufficient. These functions can be | ||
applied at different levels: element-wise, row-wise, column-wise, or group-wise, | ||
and change the data differently, depending on the method used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "change the data differently" sounds very close to mutating in a UDF, which we explicitly do not support. What do you think of "behave differently".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“Behave differently” sounds clearer and avoids implying mutation. I'll update it!
* :meth:`~DataFrame.apply` - A flexible method that allows applying a function to Series, | ||
DataFrames, or groups of data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking we should remove groups of data
here. DataFrame.apply
that you're referencing doesn't operate on groups, and you mention groupby below.
ways to apply UDFs across different pandas data structures. | ||
|
||
.. note:: | ||
Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the other objects to this note, it seems to me they all belong together.
Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`. | |
Some of these methods are can also be applied to groupby, resample, and various window objects. See :ref:`groupby`, :ref:`resample()<timeseries>`, :ref:`rolling()<window>`, :ref:`expanding()<window>`, and :ref:`ewm()<window>` for details. |
When to use: Use :meth:`DataFrame.agg` for performing aggregations like sum, mean, or custom aggregation | ||
functions across groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things like .agg(["sum", "mean"])
aren't UDFs, so I don't think they should be mentioned here, and it could be make users think these types of usages are slow (they are not).
When to use: Use :meth:`DataFrame.agg` for performing aggregations like sum, mean, or custom aggregation | |
functions across groups. | |
When to use: Use :meth:`DataFrame.agg` for performing custom aggregations, where the operation returns a scalar value on each input. |
}) | ||
|
||
# Using transform with mean | ||
df['Mean_Transformed'] = df.groupby('Category')['Values'].transform('mean') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't an example of a UDF. I really like your example of using linear regression - can we do that here? It's a bit unfortunate that groupby.transform does not allow operating on the entire group (only works column-by-column) here.
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({
'group': ['A', 'A', 'A', 'B', 'B', 'B'],
'x': [1, 2, 3, 1, 2, 3],
'y': [2, 4, 6, 1, 2, 1.5]
}).set_index("x")
# Function to fit a model to each group
def fit_model(group):
x = group.index.to_frame()
y = group
model = LinearRegression()
model.fit(x, y)
pred = model.predict(x)
return pred
result = df.groupby('group').transform(fit_model)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent job here @arthurlw, thanks for taking care of this. I added a general comment about using examples to incrementally illustrate what it's explain here, and changing a bit the order of the sections.
Please let me know if it doesn't make sense or you have any comment. I'll review more in depth after the proposed changes are implemented or discussed. But in a first look, this is really nice.
doc/source/user_guide/index.rst
Outdated
@@ -88,3 +88,4 @@ Guides | |||
sparse | |||
gotchas | |||
cookbook | |||
user_defined_functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this before the groupby
section. It feels more natural to me to explain first Series.apply
and later explain groupby("col").apply
.
and change the data differently, depending on the method used. | ||
|
||
Why Not To Use User-Defined Functions | ||
----------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if Sphinx is more flexible now, but this had to be the same exact length as the title before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Title marker needs to be at least as long as the text, but can be longer.
{{ header }} | ||
|
||
************************************** | ||
Introduction to User-Defined Functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduction to User-Defined Functions | |
User-Defined Functions (UDFs) |
This will be what will be shown in the index too, so better to be concise. Also, better for consistency to remove the Introduction to
, which we could have in every other user guide too.
applied at different levels: element-wise, row-wise, column-wise, or group-wise, | ||
and change the data differently, depending on the method used. | ||
|
||
Why Not To Use User-Defined Functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just personal opinion, but to me it makes more sense to explain what UDFs are in pandas before explaining when not to use them. This order seems reasonable assuming users already know what pandas udfs are in practice, but I'd personally prefer not to assume it in the user guide for UDFs.
In my opinion, after the previous introduction which is great, I'd show a very simple example so we make sure users reading this understand the very basics.
Something like:
def add_one(x):
return x + 1
my_series = pd.Series([1, 2, 3])
my_series.map(add_one)
Building on top of this, like then showing the same with a DataFrame
, at some point showing UDFs that receive the whole column with .apply
... should help make sure users are following and understanding all the information provided here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit negative here. This is duplicating a lot of other documentation that we already have. I think we should instead link to that documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind pointing out to an specific example @rhshadrach? I found documentation for the aggregate functions, but not much for the map
, apply
... on Series
and DataFrame
other than in the API docs. I agree with not having much duplication. Personally, if there is few here and there like in the FAQs, Performance page... I'd rather have the docs related to these methods in this page, as it feels like the natural place, and link to the sections here in the FAQs, performance hints, groupby user guide... Of course there can be cases where it makes more sense the opposite, but maybe we can discuss the specific cases where there is duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apply: https://pandas.pydata.org/docs/user_guide/basics.html#row-or-column-wise-function-application
map: https://pandas.pydata.org/docs/user_guide/basics.html#applying-elementwise-functions
I'd rather have the docs related to these methods in this page, as it feels like the natural place
If we are going to move the docs on e.g. DataFrame.agg
here, then this no longer is a page just about UDFs as DataFrame.agg
does more than just use UDFs. In addition, that seems like a large reworking of the docs for little (in my opinion, actually negative) benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally missed the Essential basic functionality page, thanks for pointing that out. Fully agree with you that what I proposed here is repeating again the whole https://pandas.pydata.org/docs/user_guide/basics.html#function-application section . And I agree that's not a good idea.
Personally, I'd rather not have that section, and have that content here. At least in my experience, map and apply are common, but not essential as other parts described in that page. And also, I think the structure of the user guide will be clearer and easier to find things with the changes.
For the DataFrame.agg
, there is already a groupby page, and I think just having the methods in the lists of methods that support udfs would be good, and then just a mention that points out to the group by page where all the detail explanation regarding groupping is presented with examples.
There may be other structures, but what I'd like is that we can give users structure to the related methods. I think Series
has around 200 methods and attributes. Users having to navigate that whole API to find out themselves that map, apply and pipe are kind of the same just changing the input of the udf, doesn't seem ideal. I think this page here can really help in that.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I'd rather not have that section, and have that content here.
If we move the main document of apply
here, then I am quite opposed to calling this a page on UDFs as apply does more than just take UDFs. By documenting apply("sum")
et al here, it seems to me we make this page far less clear than leaving it as solely UDFs.
In any case, is that something you think should be tackled in this PR? This PR started as
A dedicated page in the users guide that guides users on when to use udf, a general idea of the API, the differences between the different methods, the options available... seems a better idea.
I do not think we should morph it into moving around documentation from other places, especially when there are disagreements.
Users having to navigate that whole API to find out themselves that map, apply and pipe are kind of the same just changing the input of the udf, doesn't seem ideal.
Which is why I think this page should be a comparison of UDF methods (as it mostly is now), while pointing to more thorough documentation elsewhere in the User Guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I think I understand your point better now. Maybe I'd like to improve a bit the apply/maps docs in essential, but that's unrelated to this PR. And happy to move forward here focussing on the UDFs and not on the methods, as you describe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, just a couple of small comment. And we need to decide about duplication, but in general looks great. Thanks for the work here @arthurlw
* :meth:`~DataFrame.apply` - A flexible method that allows applying a function to Series and | ||
DataFrames. | ||
* :meth:`~DataFrame.agg` (Aggregate) - Used for summarizing data, supporting custom | ||
aggregation functions. | ||
* :meth:`~DataFrame.transform` - Applies a function to Series and Dataframes while preserving the shape of | ||
the original data. | ||
* :meth:`~DataFrame.filter` - Filters Series and Dataframes based on a list of Boolean conditions. | ||
* :meth:`~DataFrame.map` - Applies an element-wise function to a Series or Dataframe, useful for | ||
transforming individual values. | ||
* :meth:`~DataFrame.pipe` - Allows chaining custom functions to process Series or | ||
Dataframes in a clean, readable manner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about having this as a table? Personally I think it should make it easier to understand the differences about the methods. As a general idea:
method | function input | function output | description |
---|---|---|---|
map | scalar | scalar | map each element to the element returned by the function elementwise |
apply(axis=0 | column | column | map each column to the column returned by the function |
apply(axis=1) | row | row | map each row to the row returned by the function |
pipe | series or dataframe | series or dataframe | map the series or dataframe to a new series or dataframe returned by the function |
Not sure if it makes sense to combine with the table below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree with you, thanks for the suggestion! I will keep the two tables separate for now
.. note:: | ||
:meth:`DataFrame.filter` does not accept UDFs, but can accept | ||
list comprehensions that have UDFs applied to them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure on having filter
here for now. I think it's very good that you added it, as it doesn't support udfs, but it probably should. So, it opens a discussion we probably want to have about adding them. @rhshadrach thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the reason this was added is that DataFrameGroupBy.filter
does accept UDFs. Perhaps that should be mentioned instead?
I actually think DataFrame.filter
should accept Boolean masks, similar to PySpark and Polars. But agreed that discussion is not for here!
Documentation can be found at :meth:`~DataFrame.pipe`. | ||
|
||
|
||
Best Practices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just personal preference, but these last 3 sections seem to be talking about the same (performance), I'd have just a section about performance.
I'd keep it short for now, and we can iterate over it later. The reason is that each time we review this before merging it we need to re-read the whole document. So, if we can finish the main part above first, and have this as a placeholder, then in a second PR we can focus more on performance without having to keep reviewing the first part again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Some minor comments, and as you've seen, we need to make a decision on the structure to use. But I think this guide is a great addition.
| :meth:`map` | Scalar | Scalar | Maps each element to the element returned by the function element-wise | | ||
+----------------------------+------------------------+--------------------------+---------------------------------------------------------------------------+ | ||
| :meth:`apply` (axis=0) | Column (Series) | Column (Series) | Apply a function to each column | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comes from my suggestion, but checking the descriptions now, feels like it'd be helpful to use the use the same terminology, to show how similar both functions are. Meaning that if we use Apply a function to each column
, I think it'd be helpful to use Apply a function to each element
. Feel free to disagree, and I see the point on using the name of the method in the description. But personally, I think it'd be good to highlight how similar these methods are, and let users understand the difference easily and quickly, and I think what I'm proposing should help with that.
+----------------------------+------------------------+--------------------------+---------------------------------------------------------------------------+ | ||
| :meth:`agg` | Series/DataFrame | Scalar or Series | Aggregate and summarizes values, e.g., sum or custom reducer | | ||
+----------------------------+------------------------+--------------------------+---------------------------------------------------------------------------+ | ||
| :meth:`transform` | Series/DataFrame | Same shape as input | Transform values while preserving shape | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transform is actually very similar to apply. The function input is a column or a row depending on axis. And the output is also a column or a row (exactly same as apply). The only difference is that apply
allows to return a shorter or longer dataframe, while transform will raise. See this example:
# apply works fine when the output still have 3 samples:
>>> pandas.DataFrame({"points": [100, 30, 50]}).apply(lambda x: pandas.Series([1, 2, 3]))
points
0 1
1 2
2 3
# transform also works fine
>>> pandas.DataFrame({"points": [100, 30, 50]}).transform(lambda x: pandas.Series([1, 2, 3]))
points
0 1
1 2
2 3
# apply is still happy now that we removed one of the samples:
>>> pandas.DataFrame({"points": [100, 30, 50]}).apply(lambda x: pandas.Series([1, 2]))
points
0 1
1 2
# transform is not happy:
>>> pandas.DataFrame({"points": [100, 30, 50]}).transform(lambda x: pandas.Series([1, 2]))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/mgarcia/src/pandas/pandas/core/frame.py", line 10269, in transform
result = op.transform()
^^^^^^^^^^^^^^
File "/home/mgarcia/src/pandas/pandas/core/apply.py", line 356, in transform
raise ValueError("Function did not transform")
ValueError: Function did not transform
This is more useful with groupby, and I'm not even sure if DataFrame.transform
is that useful, it probably mostly exist for consistency.
+----------------------------+------------------------+--------------------------+---------------------------------------------------------------------------+ | ||
| :meth:`transform` | Series/DataFrame | Same shape as input | Transform values while preserving shape | | ||
+----------------------------+------------------------+--------------------------+---------------------------------------------------------------------------+ | ||
| :meth:`filter` | Series/DataFrame | Series/DataFrame | Filter data using a boolean array | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I'd remove the function input/output fields, as there is no function.
df["new_col"] = df.apply(calc_ratio, axis=1) | ||
|
||
# Vectorized Operation | ||
df["new_col2"] = 100 * (df["one"] / df["two"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth mentioning and comparing also .pipe
, which is both vectorized and a udf?
Tests added and passed if fixing a bug or adding a new featureAll code checks passed.Added type annotations to new arguments/methods/functions.Added an entry in the latestdoc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.